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Models for Paired Comparison Data 
A Review with Emphasis on 
Dependent Data 
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Abstract. Thurstonian and Bradley-Terry models are the most com- 
monly applied models in the analysis of paired comparison data. Since 
their introduction, numerous developments have been proposed in dif- 
ferent areas. This paper provides an updated overview of these ex- 
tensions, including how to account for object- and subject-specific co- 
variates and how to deal with ordinal paired comparison data. Special 
emphasis is given to models for dependent comparisons. Although these 
models are more realistic, their use is complicated by numerical difficul- 
ties. We therefore concentrate on implementation issues. In particular, 
a pairwise likelihood approach is explored for models for dependent 
paired comparison data, and a simulation study is carried out to com- 
pare the performance of maximum pairwise likelihood with other lim- 
ited information estimation methods. The methodology is illustrated 
throughout using a real data set about university paired comparisons 
performed by students. 

Key words and phrases: Bradley-Terry model, limited information es- 
timation, paired comparisons, pairwise likelihood, Thurstonian models. 



1. INTRODUCTION 

Paired comparison data originate from the com- 
parison of objects in couples. This type of data arises 
in numerous contexts, especially when the judgment 
of a person is involved. Indeed, it is easier for peo- 
ple to compare pairs of objects than ranking a list 
of items. There are other situations that may be 
regarded as comparisons from which a winner and 
a loser can be identified without the presence of 
a judge. Both these instances can be analyzed by 
the techniques described in this paper. 



Manuela Cattelan is Postdoctoral Research Fellow, 
Department of Statistical Sciences, University of Padua, 
via C. Battisti 241, 35121 Padova, Italy e-mail: 
manuela. cattelan@unipd.it. 



This is an electronic reprint of the original article 
published by the Institute of Mathematical Statistics in 
Statistical Science, 2012, Vol. 27, No. 3, 412-433. This 
reprint differs from the original in pagination and 
typographic detail. 



The objects involved in the paired comparisons 
can be beverages, carbon typewriter ribbons, lot- 
teries, players, moral values, physical stimuli and 
many more. Here, the elements that are compared 
are called objects or sometimes stimuli. The paired 
comparisons can be performed by a person, an agent, 
a consumer, a judge, et cetera, so the terms subject 
or judge will be employed to denote the person that 
makes the choice. 

The bibliography by Davidson and Farquhar (1976), 
which includes more than 350 papers related to paired 
comparison data, testifies to the widespread inter- 
est in this type of data. This interest is still present 
and extensions of models for paired comparison data 
have been proposed. This paper focuses on recent 
extensions of the two traditional models, the Thur- 
stone (1927) and the Bradley-Terry (Bradley and 
Terry (1952)) model, especially those subsequent to 
the review by Bradley (1976) and the monograph by 
David (1988), including in particular the work that 
has been done in the statistical and the psychomet- 
ric literature. 
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Section 2 reviews models for independent data. 
After the introduction of the two classical models 
for the analysis of paired comparison data and a sm'- 
vey of different areas of application, Sections 2.3 
and 2.4 review extensions for ordinal paired com- 
parison data and for inclusion of explanatory vari- 
ables. Section 3 reviews models that allow for depen- 
dence among the observations and outlines the infer- 
ential problems related to such an extension. Here, 
a pairwise likelihood approach is proposed to esti- 
mate these models, and a simulation study is per- 
formed in order to compare the estimates produced 
by maximum likelihood, a common type of limited 
information estimation and pairwise likelihood. Sec- 
tion 4 reviews existing R (R Development Core Team 
(2011)) packages for the statistical analysis of paired 
comparison data, and Section 5 concludes. 

2. INDEPENDENT DATA 
2.1 Traditional Models 

Let Ysij denote the random variable associated 
with the result of the paired comparison between 
objects i and j , j > i = 1, . . . ,n, made by subject 
s = l,...,S, and let = {¥^12, . ■ .,Ysn-in) be the 
vector of the results of all paired comparisons made 
by subject s. When 5 = 1 or the difference between 
judges is not accounted for in the model, then the 
subscript s will be dropped. If each possible paired 
comparison is performed, they number = n{n — 
1 ) /2 , and SN = Sn{n — 1 ) /2 in a multiple j udgment 
sampling scheme, that is, when all paired compar- 
isons are made by all S subjects. Different sampling 
schemes are possible. When each paired comparison 
is performed by a different subject, the outcomes 
are independent. In other instances, a subject per- 
forms more than one paired comparison; in this case, 
it is conceivable that results of several paired com- 
parisons performed by the same subject will not be 
independent. In Section 2, independence among ob- 
servations is assumed while Section 3 addresses the 
issue of dependent data, assuming that each subject 
performs all N paired comparisons, except for Sec- 
tion 3.3 which considers the case of dependence not 
induced by judges. 

Let Hi G'R, i = 1, . . . ,n, denote the notional worth 
of the objects. Traditional models were developed 
assuming only two possible outcomes of each com- 
parison, so Yij is a binary random variable, and nij, 
the probability that object i is preferred to object j, 
depends on the difference between the worth of the 



two objects 

(2.1) TTij = F{fli- fij), 

where F is the cumulative distribution function of 
a zero-symmetric random variable. Such models are 
called linear models by David (1988). When F is the 
normal cumulative distribution function, formula (2.1) 
defines the Thurstone (1927) model, while if F is the 
logistic cumulative distribution function, then the 
Bradley-Terry model (Bradley and Terry (1952)) is 
recovered. Other specifications are possible; for ex- 
ample. Stern (1990) suggests modeling the worth 
parameters as independent gamma variables with 
the same shape parameter and different scale pa- 
rameter. The Thurstone model is also known as the 
Thurstone-Mosteller model since Hosteller (1951) 
presented some inferential techniques for the model, 
while the Bradley-Terry model was independently 
proposed also by Zermelo (1929) and Ford (1957). 
Model (2.1) is called unstructured model, and the 
aim of the analysis is to make inference on the vec- 
tor n = (//I, . . . , Hn)' of worth parameters which can 
be used to determine a final ranking of all the ob- 
jects compared. Note that the specification of model 
(2.1), through all the pairwise differences fii — Hj, 
implies that a constraint is needed in order to iden- 
tify the parameters. Various constraints can be spec- 
ified: the most common are the sum constraint, 
Y17=i ~ ^'^d the reference object constraint, 
^j-i = for one object i £ {1, . . . ,n}. 

The comparative nature of the data poses inferen- 
tial and interpretational problems. Consider two dif- 
ferent studies, for example, about beverages. If sub- 
jects were requested to express an absolute measure 
of like/dislike for each drink in a categorical scale, 
then the data obtained from the two studies might 
be analyzed all together. On the contrary, if the sub- 
jects express preferences in paired comparisons, the 
data can be combined only if at least one object 
is common to both studies; otherwise the data can 
be analyzed separately, and no conclusions can be 
made about relationships between objects in the two 
different studies. Indeed, the lack of origin implies 
that no absolute statement can be made about the 
data and two subjects can provide the same sets 
of preferences, but one may dislike all items while 
the other may like all of them. The identification 
of an origin may be useful for understanding the 
underlying psychological process, for discriminating 
between desirable and undesirable objects and for 
identifying the degree of an option desirability in 
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different conditions. However, it is not possible to re- 
cover the origin without further choice experiments 
and/or further assumptions (Thurstone and Jones 
(1957); Bockenholt (2004)). Despite ah their hm- 
its, paired comparison data are widespread because 
of their ease of performance and their discrimina- 
tory abihty since objects that may be judged in 
the same hke/dishke category may be differentiated 
when compared pairwise. 

If the reference object constraint is employed, the 
identified worth parameters are differences with re- 
spect to the reference object. Hence, inference will 
typically regard differences between estimated worth 
parameters with the related statistical problems. For 
example, for testing Hq : = fij by means of the 
Wald test statistic (/i, — /ij)/{vBx(/ii — /ij)}-^/^, where 
fti is the maximum likelihood estimator of fit, the 
covariance between the estimators of the worth pa- 
rameters is needed. In general, the whole covariance 
matrix of the worth parameters should be reported 
in order to allow the final users to perform the tests 
they are interested in. However, it is very inconve- 
nient to report that matrix and a useful alterna- 
tive may be to report quasi-standard errors (Firth 
and de Menezes (2004)) instead of the usual stan- 
dard errors since they allow approximate inference 
on any of the contrasts. Let c be a vector of zero- 
sum constants. If the parameters /x were indepen- 
dent, then the estimated standard error of c' fi would 
be (X]r=i where Vi denotes the estimated 
variance of fn. Quasi- variances are a vector of con- 
stants q such that 

n 

1=1 

so they have the property that they add over the 
components of yu, and hence can be used to ap- 
proximate variances of contrasts of estimated worth 
parameters as if they were independent. Let p{qi + 
qj,vai{fli — fij)), be a penalty function which de- 
pends on the quasi- variances and the estimated vari- 
ance of the difference fn — fij , then quasi- variances 
are computed through minimization of the sum of 
the penalty function for all contrasts; see Firth and 
de Menezes (2004, Section 2.1). 

Further statistical problems arising from the com- 
parative nature of the data are discussed in Sec- 
tion 3.2.2. 

Example. A program supported by the Euro- 
pean Union offers an international degree in Eco- 
nomics and Management. Twelve universities take 



part in this program, and in order to receive a de- 
gree, a student in the program must spend a semester 
at another university taking part in the program. 
Usually, some universities receive more preferences 
than others, and this may cause organizational prob- 
lems. A study was carried out among 303 students 
of the Vienna University of Economics who were 
asked in which university they would prefer to spend 
the period abroad, between six universities situated 
in Barcelona (Escuela Superior de Administracion 
y Direccion de Empresas), London (London School 
of Economics and Political Sciences), Milan (Uni- 
versita Luigi Bocconi), Paris (Hautes Etudes Com- 
merciales), St. Gallen (Hochschule St. Gallen) and 
Stockholm (Stockholm School of Economics), com- 
pared pairwise. This example will be used through- 
out the paper as an illustration. For an exhaustive 
analysis of the data refer to Dittrich, Hatzinger and 
Katzenbeisser (1998, 2001). The data set is avail- 
able in both the pref mod (Hatzinger (2010)) and the 
BradleyTerry2 (Turner and Firth (2010a)) R pack- 
ages; see Section 4. Table 1 reports the aggregated 
data on the 15 paired comparisons. For example, the 
first row shows that in the paired comparison be- 
tween London and Paris, 186 students prefer Lon- 
don, 91 students prefer Paris and 26 students do 
not have a preference between the two universities. 
Moreover, 91 students unintentionally overlooked the 
comparison between Paris and Milan which has only 
212 answers. The second column of Table 2 shows 

Table 1 

Universities paired comparison data. 1 and 2 refer 
to the number of choices in favor of the university 
in the fist and the second column, respectively, while X 
denotes the number of no preferences expressed 







1 


X 


2 


London 


Paris 


186 


26 


91 


London 


Milan 


221 


26 


56 


Paris 


Milan 


121 


32 


59 


London 


St. Gallen 


208 


22 


73 


Paris 


St. Gallen 


165 


19 


119 


Milan 


St. Gallen 


135 


28 


140 


London 


Barcelona 


217 


19 


67 


Paris 


Barcelona 


157 


37 


109 


Milan 


Barcelona 


104 


67 


132 


St. Gallen 


Barcelona 


144 


25 


134 


London 


Stockholm 


250 


19 


34 


Paris 


Stockholm 


203 


30 


70 


Milan 


Stockholm 


157 


46 


100 


St. Gallen 


Stockholm 


155 


50 


98 


Barcelona 


Stockholm 


172 


41 


90 
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Table 2 

Estimates (Est.), standard errors (S.E.) and quasi-standard 
errors (Q.S.E.) of the universities worth parameters 
employing a two-categorical Thurstone model (Thurstone) 
and a cumulative extension of the Thurstone model 
(cumulative Thurstone) 



Thurstone cumulative Thurstone 





Est. 


S.E. 


Q.S.E. 


Est. 


S.E. 


Q.S.E. 


Barcelona 


0.333 


0.043 


0.030 


0.332 


0.041 


0.028 


London 


0.982 


0.045 


0.033 


0.998 


0.043 


0.031 


Milan 


0.240 


0.044 


0.031 


0.241 


0.041 


0.029 


Paris 


0.561 


0.044 


0.031 


0.566 


0.042 


0.030 


St. Gallen 


0.325 


0.043 


0.030 


0.324 


0.040 


0.028 


Stockholm 







0.031 







0.029 


T2 








0.153 


0.007 





the estimate of the worth parameters for the six 
universities using the Thurstone model and adding 
half of the number of no preferences to each univer- 
sity in the paired comparison. In Section 2.3 a bet- 
ter way to handle no preference data will be dis- 
cussed. 

The reference object constraint is used, and the 
worth parameter of Stockholm is set to zero. All es- 
timates are positive, so we can conclude that Stock- 
holm is the least preferred university, while London 
is the most preferred one, followed by Paris, Barce- 
lona, St. Gallen and Milan. The estimated probabil- 
ity that London is preferred to Paris is $(0,982 — 
0.561) = 0.66, where <I> denotes the cumulative dis- 
tribution function of a standard normal random vari- 
able. If it is of interest to test whether the worth of 
St. Gallen is significantly higher than the worth of 
Milan, the standard error of the difference between 
these two worth parameters can be approximated 
by means of the quasi-standard errors as (0.030^ -|- 
0.0312)1/2 ^ Q_Q43 Quasi-standard errors are lower 
than standard errors, thus accounting for the pos- 
itive covariance between parameter estimates. The 
value of the test statistic is (0.325 - 0.240)/0.043 = 
1.98, which yields a p- value of 0.02; hence the hy- 
pothesis of equal worth parameters between St. Gal- 
len and Milan is not supported by the data. 

2.2 Applications 

There are many different areas in which paired 
comparison data arise. Here, a number of recent ap- 
plications are described, and further references can 
be found in Bradley (1976), Davidson and Farquhar 
(1976) and David (1988). 



Despite its simplicity, the basic Bradley-Terry and 
Thurstone models have found a wide range of ap- 
plications. Choisel and Wickelmaier (2007) analyze 
pairwise evaluations of sounds through a standard 
Bradley-Terry model, while Bauml (1994) and Kis- 
sler and Bauml (2000) present applications involv- 
ing facial attractiveness. In Mazzucchi, Linzey and 
Bruning (2008) the standard Bradley-Terry model 
is applied to a reliability problem. A panel of wiring 
experts is asked to state which is the riskier one be- 
tween different scenarios compared pairwise in order 
to determine the probability of wire failure as a func- 
tion of infiuencing factors in an aircraft environ- 
ment. Stigler (1994) uses the traditional Bradley- 
Terry model for ranking scientific journals, and the 
same model is exploited in genetics by Sham and 
Curtis (1995). 

Maydeu-Olivares and Bockenholt (2008) list 10 
reasons to use Thurstone's model for analyzing sub- 
jective health outcomes, including the ease for re- 
spondents, the existence of extensions for modeling 
inconsistent choices and for including covariates and 
the possibility to investigate which aspects infiuence 
the choices of subjects. 

In many applications there are more than two pos- 
sible outcomes of the comparisons. Henery (1992) 
employs a Thurstone model for ranking chess players 
and adapts it to three possible results: win, draw and 
loss. Bockenholt and Dillon (1997a) consider a five- 
response-categories model for applications to taste 
testing of beverages and to preferences for brands 
of cigarettes. Dittrich, Hatzinger and Katzenbeisser 
(2004) consider motives to start a Ph.D. program 
using three response categories in the log-linear ver- 
sion of the Bradley-Terry model. 

It is often of interest to investigate whether some 
covariates affect the results of the comparisons. EUer- 
meier, Mader and Daniel (2004) employ a Bradley- 
Terry model to analyze pairwise evaluations of sounds 
and include sound-related covariates, for example, 
roughness, sharpness, et cetera, to evaluate which 
of them contribute to the unpleasantness of sounds. 
Duineveld, Arents and King (2000) use the log-linear 
formulation of the Bradley-Terry model to investi- 
gate consumer preference data on orange soft drinks 
including an analysis of the factorial design for the 
drinks compared, while Francis et al. (2002) include 
subject-specific covariates in the analysis of value 
orientation of people in different European coun- 
tries. Applications of the Bradley-Terry model are 
present also in zoological data in order to investi- 
gate aspects of animal behavior considering animal- 
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specific covariates (Stuart-Fox et al. (2006); Whit- 
ing et al. (2006); Head et al. (2008)). Agresti [(2002), 
Chapter 10] extends the Bradley-Terry model to ac- 
count for the home advantage effect in baseball data. 

Sometimes it is more realistic to include depen- 
dence among observations. Object-specific random 
effects can be used to introduce correlation between 
comparisons with common objects, for example, in 
sports data (Cattelan (2009)). When all judges per- 
form all paired comparisons, random effects can in- 
troduce correlation between preferences expressed 
by the same subject involving a common object as 
shown in Bockenholt and Tsai (2007) for the univer- 
sity preference data. 

When paired comparisons are performed in pro- 
longed time periods, it may be necessary to account 
for it. McHale and Morton (2011) estimate a Bradley- 
Terry model in which tennis matches distant in time 
are down-weighted since the aim is to predict the re- 
sults of future matches. Further dynamic extensions 
for sports data have been proposed by Barry and 
Hartigan (1993), Fahrmeir and Tutz (1994), Knorr- 
Held (2000) and Cattelan, Varin and Firth (2012). 
In tournaments it may happen that a player wins 
all the comparisons in which he is involved. In this 
case a standard Bradley-Terry or Thurstone model 
would estimate an infinity worth parameter for this 
team. Mease (2003) proposes a penalization of the 
likelihood which overcomes this problem. The meth- 
od proposed by Firth (1993) to reduce the bias of 
the maximum likelihood estimates is an alternative 
technique to obtain finite estimates in this instance. 
Finally, the case in which the margin of victory in 
sport contests is not discrete, but continuous, is an- 
alyzed in Stern (2011). 

In the context of the log-linear specification of the 
Bradley-Terry model, Dittrich et al. (2012) account 
also for missing responses in a study about the qual- 
ities of a good teacher. 

2.3 Ordinal Paired Comparisons 

Sometimes subjects are requested to express a de- 
gree of preference. Suppose that objects i and j are 
compared, and the subject can express strong pref- 
erence for i over j, mild preference for i, no prefer- 
ence, mild preference for j over i or strong preference 
for j. li H denotes the number of grades of the scale, 
then in this example, H = 5. 

Let Yij = 1,. . . ,H, where 1 denotes the least fa- 
vorable response for i, and H is the most favorable 
response for i. Agresti (1992) shows how two mod- 



els for the analysis of ordinal data can be adapted 
to ordinal paired comparison data. The cumulative 
link models exploit the latent random variable rep- 
resentation. Let Zij be a continuous latent random 
variable, and let ri < r2 < • • • < th-i denote thresh- 
olds such that Yij = h when t^-i < Zij < t^. Then, 

(2.2) pr(yij < yij) = F{Ty^^ -fii + fij), 

where — oo = tq <ti < ■ ■ ■ < th~i <th = oo, and F 
is the cumulative distribution function of the latent 
variable Zij. F is usually assumed to be either the lo- 
gistic or the normal distribution function leading to 
the cumulative logit or the cumulative probit model, 
respectively. The symmetry of the model imposes 
that Th = —TH-hi h = 1, . . . , H and th/2 = when H 
is even. When H = 3 there are two threshold param- 
eters, Ti and T2, such that ri = —T2 and model (2.2) 
corresponds to the extension of the Bradley-Terry 
model introduced by Rao and Kupper (1967) when 
a logit link is considered, and the extension of the 
Thurstone model by Glenn and David (1960) when 
the probit link is employed. 

An alternative model proposed by Agresti (1992) 
is the adjacent categories model. In this case the 
link is applied to adjacent response probabilities, 
rather than cumulative probabilities and reduces to 
the Bradley-Terry model when only 2 categories 
are allowed and to the model proposed by David- 
son (1970) when 3 categories are allowed. The ad- 
jacent categories model is simpler to interpret than 
cumulative link models since the odds ratio refers to 
a given outcome instead of referring to groupings of 
outcomes (Agresti (1992)). The adjacent categories 
model, as well as the Bradley-Terry model, has also 
a log-linear representation (Dittrich, Hatzinger and 
Katzenbeisser (2004)). 

An application of the adjacent categories model to 
market data is illustrated in Bockenholt and Dillon 
(1997b). Bockenholt and Dillon (1997a) note that 
a bias may be caused by the usage of the scale be- 
cause subjects may use only subsets of all categories. 
The threshold parameters Th can account for the se- 
lection bias, for example, in the cumulative probit 
model the quantity <I>(t/i) — $(t/i_i) gives the cate- 
gory selection bias since it is the probability of se- 
lecting category h when the two stimuli are equal. 
Different latent classes of consumers with different 
threshold values and worth parameters can be iden- 
tified. If subjects share the same worth parameters 
but have different thresholds, it is possible to let 
thresholds depend on subject-specific covariates and 
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to have a random part (Bockenholt (2001b)). It is 
also possible to define thresholds that depend on the 
objects compared, as in Henery (1992). 

Example. In the paired comparisons of univer- 
sities, students were allowed to express no preference 
between two universities. Therefore, the data should 
be analyzed by means of a model for ordinal data. 
Columns 5-7 in Table 2 show the estimates of a cu- 
mulative probit extension of the Thurstone model 
for the university data. The estimated threshold pa- 
rameter f2 = 0.153 is highly significant. In this par- 
ticular case, the estimates of the worth parameters 
and their standard errors are very similar to those 
of the model with two categories, and the ranking of 
universities remains the same, but in general, espe- 
cially when the number of no preferences is large, re- 
sults can be different. Moreover, in this case it is pos- 
sible to estimate the probability of no preference be- 
tween London and Paris which is $(0,153 — 0.998 + 
0.566) - ^>(-0.153 - 0.998 + 0.566) = 0.11, and the 
estimated probability that London is preferred to 
Paris reduces to 1 - ^>(0.153 - 0.998 + 0.566) = 0.61; 
hence the estimated probability that Paris is pre- 
ferred to London is 0.28. There is no much differ- 
ence from the previous result in the test of equality 
of worth parameters for universities in St. Gallen 
and Milan. 

2.4 Explanatory Variables 

In many instances, it is of interest to investigate 
whether some explanatory variables affect the re- 
sults of the comparisons. Explanatory variables can 
be related to the objects compared, to the subjects 
performing the comparisons or they can be compar- 
ison-specific. 

Let Xj = (xji, . . . ,Xip)' be a vector of P explana- 
tory variables related to object i and /3 = (/3i , . . . , /3p) 
be a P-dimensional parameter vector. Then, in the 
context of the Bradley-Terry model, Springall (1973) 
proposes to describe the worth parameters as the 
linear combination 

(2.3) m = Xiil3i^ hxip/3p, i = l,...,n. 

A paired comparison model with explanatory vari- 
ables is called a structured model. The same exten- 
sion can be applied to the Thurstone model. Note 
that since only the differences //j — Hj = (xj — Xj)'/3 
enter the linear predictor, an intercept cannot be 
identified. In some instances, both worth parame- 
ters of objects and further object-specific covariates 
are included, hence the linear predictor assumes the 
form — /ij + (xj — Xj)'/3; see Stern (2011). 



Model (2.3) has been extended to more fiexible 
models, such as additive combinations of spline 
smoothers (De Soete and Winsberg (1993)); how- 
ever large data sets may be necessary to estimate 
nonlinearities reliably, even though there is no in- 
vestigation about this issue. 

In case worth parameters are specified as in (2.3), 
standard errors for the worth parameters can be 
computed through the delta method, while when 
both the worth parameters and covariates are in- 
cluded in the linear predictor, quasi-standard errors 
can be computed for the worth parameters. 

The results of the comparisons can be influenced 
also by characteristics of the subject that performs 
the paired comparisons. In the log-linear representa- 
tion of the Bradley-Terry model, Dittrich, Hatzinger 
and Katzenbeisser (1998) show how to include cate- 
gorical subject specific covariates, while Francis et al. 
(2002) tackle the problem of continuous subject- 
specific covariates and consider also the case in which 
some of these covariates have a smooth nonlinear re- 
lationship. 

Dillon, Kumar and De Borrero (1993) consider 
a marketing application and divide subjects in la- 
tent classes to which they belong with a probability 
that depends on their explanatory variables. 

Covariates can be added in the linear predictor 
(2.3) if they are subject-object interaction effects. 
For example, the knowledge of a foreign language 
may influence the preference for a university. An in- 
teraction effect can account for whether the student 
knows, for example, Spanish and one object in the 
comparison is the university in Barcelona. Unfortu- 
nately, subject covariates that do not interact with 
objects, such as age of respondents, cannot be in- 
cluded. 

A semiparametric approach which accounts for 
subject-speciflc covariates is proposed by Strobl, 
Wickelmaier and Zeileis (2011) who suggest a meth- 
odology to partition recursively the subjects that 
perform the paired comparisons on the basis of their 
covariates. The procedure tests whether structural 
changes in the parameters occur for subjects with 
different values of the covariates. Subjects are split 
according to the test and a different unstructured 
Bradley-Terry model is fltted for each subgroup. 
The method allows us to identify which covariates 
influence the worth parameters without the need to 
assume a model for them and flnds the best cut 
point in case of continuous covariates. Moreover, it 
is possible to include subject-specific covariates, not 
only interaction effects. Attention is needed in set- 
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ting the minimum number of subjects per class and 
in setting the significance level of the test in order 
to avoid overfitting for large data sets. Differently 
from the usual latent class models, the method al- 
lows to divide subjects on the basis of their covari- 
ates; however, if some important subject-specific co- 
variates are not available, it may be expected that 
the usual latent class model will perform better. In 
Strobl, Wickelmaier and Zeileis (2011) an unstruc- 
tured Bradley-Terry model is estimated for each 
subgroup, but it seems possible to extend the method 
also to structured models. 

Finally, there may be also comparison-specific co- 
variates which are related to the objects, but change 
from comparison to comparison. An example of 
a comparison-specific covariate is the home advan- 
tage effect in sport tournaments since it depends on 
whether one of the players competes in the home 
field. This effect may be accounted for by adding 
a further term in the linear predictor (2.3). An- 
other example is the experience effect in contests 
between animals which, in Stuart-Fox et al. (2006), 
is accounted for through a covariate that counts the 
number of previous contests fought by animals. 

Example. In the universities paired compari- 
sons, it may be of interest to assess whether some 
object-specific covariates influence the results of the 
comparisons. The universities in London and Milan 
specialize in economics, the universities in Paris and 
Barcellona specialize in management science and the 
remaining two in finance. This aspect may influence 
the decisions of students. Another element that may 
affect the comparisons is the location of the univer- 
sities, in this respect they can be divided in univer- 
sities in Latin countries (Italy, France and Spain) 
and universities in other countries. 

Some features of the students that performed the 
universities paired comparisons were collected, too. 
In particular, it is known whether students have 
good knowledge of English, Italian, Spanish and 
French and which is the main topic of their studies. 
It is conceivable that, for example, students with 
a good knowledge of French are more inclined to 
prefer the university in Paris. Table 3 shows the es- 
timates of a model with a linear predictor that in- 
cludes object specific covariates and subject-object 
interaction effects. Universities in non-Latin coun- 
tries are preferred to those in Latin countries, and 
universities that specialize in finance seem less ap- 
pealing to students. The good knowledge of a foreign 
language induces students to choose the university 



Table 3 

Estimates (Est.) and standard errors (S.E.) of universities 
data with subject- and object- specific covariates 





Est. 


S.E. 


Economics 


0.757 


0.066 


Management 


0.789 


0.080 


Latin country 


-0.835 


0.071 


Discipline:Management 


0.238 


0.054 


English: London 


0.141 


0.075 


French:Paris 


0.652 


0.049 


Italian:MiIan 


1.004 


0.094 


SpanishiBarcelona 


0.831 


0.095 


T2 


0.160 


0.007 



situated in the country where that foreign language 
is spoken. Consider a student with a good knowledge 
of both English and French and whose main disci- 
pline of study is management, then the estimated 
probability that this student prefers London to Paris 
is 1 - «>{0.160 - (0.141 + 0.757 - 0.652 - 0.789 + 
0.835 - 0.238)} = 0.46, while the estimated proba- 
bilities of no preference and preference for Paris are 
0.13 and 0.41, respectively. If this student's main 
discipline of study was not management, which is 
the subject in which Paris specializes, then the above 
estimated probabilities of preferring London, no pref- 
erence and preferring Paris would become 0.55, 0.12 
and 0.33, respectively. 

3. MODELS FOR DEPENDENT DATA 

3.1 Intransitive Preferences 

The models presented so far are estimated assum- 
ing independence among all observations. The inclu- 
sion of a dependence structure is not only more real- 
istic, but also has an impact on the transitivity prop- 
erties of the model. Intransitive choices occur when 
object i is preferred to j, and object j is preferred 
to A;, but in the paired comparison between i and fc, 
the latter is preferred. These are also called circular 
triads. Paired comparison models can present dif- 
ferent transitivity properties. Assume that vTjj > 0.5 
and TTjfc > 0.5, then a model satisfies: 

• weak stochastic transitivity if vrj^ > 0.5; 

• moderate stochastic transitivity if vrj^ > min(7rjj , TTjk)] 

• strong stochastic transitivity if TTjfc > max(7rjj, VTjfc). 

The Bradley-Terry and Thurstone models as pre- 
sented so far satisfy strong stochastic transitivity. 
This property may be desirable sometimes, for ex- 
ample, when asking wiring experts which is the risk- 
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ier situation between different scenarios in an air- 
craft environment. In this case it is desirable that 
choices are consistent, so Mazzucchi, Linzey and 
Bruning (2008) use transitivity to check the level 
of reliability of experts. However, in some situations 
choices can be systematically intransitive, for exam- 
ple, when the same objects have more than one as- 
pect of interest, and different aspects prevail in dif- 
ferent comparisons. 

Causeur and Husson (2005) propose a two-dimen- 
sional Bradley-Terry model in which the worth pa- 
rameter of each object is bidimensional and can thus 
be represented on a plane. A further multidimen- 
sional extension is proposed by Usami (2010). How- 
ever, this methodology does not provide a final rank- 
ing of all objects. 

A different method that allows the inclusion in the 
model even of systematic intransitive comparisons 
while yielding a ranking of all the objects consists 
of modeling the dependence structure among com- 
parisons. The development of inferential techniques 
for dependent data has recently allowed an investi- 
gation of models for dependent observations. 

3.2 Multiple Judgment Sampling 

The assumption of independence is questioned in 
the case of the multiple judgment sampling, that is, 
when S people make all the paired comparisons. 
It seems more realistic to assume that the compar- 
isons made by the same person are dependent. This 
aspect has received much attention in the literature 
during the last decade. 

3.2.1 Thurstonian models The original model 
proposed by Thurstone (1927) includes correlation 
among the observations. The model was developed 
for analyzing sensorial discrimination and assumes 
that the stimuli T = (Ti , . . . , r„)' compared in a pair- 
ed comparison experiment follow a normal distri- 
bution, T ^ A^(/x, St), with mean /x = (/ii, . . . , /x„)' 
and variance S^. Thurstone (1927) proposes differ- 
ent models with different covariance matrices of the 
stimuli, so the set of models which assume a normal 
distribution of the stimuli are called Thurstonian 
models. The single realization ti of the stimulus Tj 
can vary, and the result of the paired comparison 
between the same two stimuli can be different in 
different occasions. Assume that only either a pref- 
erence for i or a preference for j can be expressed, so 
then in a paired comparison when Tj > Tj object i is 
preferred, or alternatively, when the latent random 
variable Zy = Ti — Tj is positive, a win for i is ob- 
served; otherwise a win for j occurs. In the context 



of multiple judgment sampling, Takane (1989) pro- 
poses to include a vector of pair specific errors. Let 
Tig = {Zsi2, ■ ■ ■ , Zsn-in)' be the vector of all latent 
continuous random variables pertaining to subject s, 
then 

(3.1) Z, = AT + e,, 

where = (6^12, e^is, • • • , es„,_in)' is the vector of 
pair-specific errors which has zero mean, covariance 
fi and is independent of T and of e^/ for any other 
subject s' ^ s, and A is the design matrix of paired 
comparisons whose rows identify the paired compar- 
isons and columns correspond to the objects. For 
example, if re = 4, and the paired comparisons are 

(1.2) , (1,3), (1,4), (2, 3), (2, 4) and (3,4), then 



A = 



/I 


-1 







1 





-1 





1 








-1 





1 


-1 








1 





-1 


Vo 





1 


-1/ 



A similar model is employed by Bockenholt and Tsai 
(2001), who assume that Ss ~ A^(0, w^Iat). The more 
general analysis of covariance structure proposed by 
Takane (1989) can accommodate both the wander- 
ing vector and the wandering ideal point models (Car- 
roll and De Soete (1991)), which are models with dif- 
ferent assumptions about the mechanism originating 
the data. The wandering vector and wandering ideal 
point models do not impose the number of dimen- 
sions which is determined from the data alone, so 
they are powerful models to analyze human choice 
behavior and inferring perceptual dimensions. 

The model thus specified is over-parametrized. To 
reduce the number of parameters, Thurstone (1927) 
proposes different restrictions on the covariance ma- 
trix Sy, while Takane (1989) proposes a factor model. 
Nonetheless, these models with a reduced number of 
parameters need further identification restrictions; 
see Section 3.2.2. 

A further extension of model (3.1) is proposed 
by Tsai and Bockenholt (2008) who unify Tsai and 
Bockenholt (2006) with Takane (1989) to obtain 
a general class of models that can account simul- 
taneously for transitive choice behavior and system- 
atic deviations from it. In this case the latent vari- 
able is 

(3.2) Z, = AT + BV,, 

where = (^,1(2), "l^.i(3)> • • • , ^s2(i)> ^2(3), • • • > 

Vsn{n-i)y is a vector of zero mean random effects 
designed so as to capture the random variation in 
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judging an object when compared to another specific 
object, and B is a matrix with rows corresponding 
to the paired comparisons and columns correspond- 
ing to the elements of V^, so, for example, if n = 3, 

= {Vsi(2) , ^sl(3) ; ^2(1) ) ^s2(3) ) ^s3(l) ) ^s3(2) )' and 

/l -1 \ 
B= 1 0-1 . 
\0 1 -1/ 

It is assumed that V^, the within-judge variability, 
is normally distributed with mean and covariance 
'Sv so that Zs~iV(A/x,AS;TA'-|-BSyB'). 

In the remaining it will be assumed that there 
are only two possible outcomes of the comparisons, 
but it is easy to extend this model for ordinal data 
through the introduction of threshold parameters 
with a specification analogous to (2.2). 

3.2.2 Identification Psychometricians are interest- 
ed in understanding the relations between stimuli; 
hence they are primarily interested in the unstruc- 
tured and unrestricted Thurstonian models. Unfor- 
tunately, due to the comparative nature of the data, 
some identification restrictions on the covariance ma- 
trix are needed. The necessary identification restric- 
tions to estimate model (3.1) are discussed in May- 
deu-Olivares (2001, 2003), Tsai and Bockenholt 
(2002) and Tsai (2003). Consider the covariance ma- 
trix = Cov(Zs) = AStA' -I- ri, where St is an 
unrestricted covariance matrix. Because of the dif- 
ference structure of the judgments S-p and + 
dl' -|- Id' where 1 is a vector of n ones and d is 
an ra-dimensional vector of constants such that the 
matrix remains positive definite, are not distinguish- 
able (Tsai (2000)). Indeed, let K= [I„_i|-1] be an 
identity matrix of dimension n — 1 to which a column 
of elements equal to —1 is added, then only K/i and 
KStK' are identifiable. For example the matrices 

/I 0\ 

Sr,i = 1 , 

\0 ij 

/0.750 0.125 \ 
St,2 = 0.125 1.5 0.375 

\ 0.375 1.250 / 

are not distinguishable because K5]t,iK' = K • 
'St,2^' = ii 2)' where the second matrix is obtain- 
ed from the first one by setting d = (—1/8, 1/4, 1/8). 
This consideration remains valid for any generic ma- 
trix of contrasts that may be used instead of K. The 
specifications of the covariance matrix with a re- 
duced number of parameters proposed by Thurstone 



(1927) cannot be recovered from the data and only 
covariance classes can be considered. 

Tsai (2003) shows that n + 2 constraints are needed 
in order to identify model (3.1), including the con- 
straint on the worth parameters. As for the mean pa- 
rameters, many different constraints can be imposed 
on the covariance matrix. For example, Bockenholt 
and Tsai (2001), Tsai and Bockenholt (2002) and 
Maydeu-Olivares (2003) set all the diagonal elements 
of St equal to 1 and either one of the diagonal 
elements of fi to 1 or one of the nondiagonal ele- 
ments of St equal to zero. However, if St is fixed 
to be a correlation matrix, the set of matrixes that 
produce the same sets of probabilities is limited. 
Maydeu-Olivares and Bockenholt (2005) set all the 
covariances involving the last latent utility to zero, 
which corresponds to assuming independence be- 
tween the last stimulus and the others, and the vari- 
ance of the first and last item to one. Maydeu-Oliva- 
res (2007) suggest to set all diagonal elements of St 
equal to one and the sum of the correlations be- 
tween the first and the other latent variables to one. 
With these constraints positive entries in the corre- 
lation matrix imply that strong preference for one 
stimulus is associated with strong preference for the 
other stimulus, while negative entries indicate that 
strong preference for one stimulus is associated with 
weak preference for the other stimulus. Thus, it is 
not necessary to fix any element in the matrix ft, 
since the constraint u = 1 inQ = uj'^In could lead to 
a nonpositive definite matrix St. After estimation 
it is possible to recover the class of covariance ma- 
trixes that produce the same probabilities (Maydeu- 
Olivares and Hernandez (2007)). However, the ini- 
tial identification constraints pose limits on the set 
of covariance matrixes that identify the same model. 

There is no discussion or results about the identifi- 
cation restrictions necessary to estimate model (3.2). 
In order not to incur identification problems, Tsai 
and Bockenholt (2008) assume that the matrix Sy 
depends on very few parameters. 

3.2.3 Models with logit link The dependence be- 
tween evaluations made by the same judge has been 
introduced also in models employing logit link func- 
tions. Different specifications have been used for this 
purpose. 

A first inclusion of dependence in logit models is 
proposed by Lancaster and Quade (1983), who con- 
sider multiple judgments by the same person and 
introduce correlation in the Bradley-Terry model 
assuming that the worth parameters are random 
variables following a beta distribution with shape 
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parameters Uij and bij. The Bradley-Terry model 
is imposed on the means of the beta distributions, 
that is, E{7Tij) = aij/{aij + bij) = 7rj/(7rj + ttj), but 
such a model introduces correlation only between 
comparisons of the same judge on the same pair 
of objects, while the other comparisons remain in- 
dependent. The same limit presents the extension 
by Matthews and Morris (1995) who consider three 
possible response categories. 

Two different methods have been used for intro- 
ducing dependence among comparisons made by the 
same person involving one common object in logit 
models. The first method exploits the usual associa- 
tion measure for binary data: the odds ratio. Bocken- 
holt and Dillon (1997a) consider the adjacent cate- 
gories model for preference data with H categories 
and suggest a parametrisation in terms of log-odds 
ratios to account for dependence between observa- 
tions, while Dittrich, Hatzinger and Katzenbeisser 
(2002) adopt a similar approach in a two-categorical 
model using the log-linear formulation of the Brad- 
ley-Terry model. This specification is convenient be- 
cause it allows one to estimate the model through 
standard software developed for log-linear models, 
but the number of added parameters can be quite 
large (Dittrich, Hatzinger and Katzenbeisser (2002)). 

Another method used for introducing dependence 
among observations is the inclusion of random ef- 
fects in the linear predictor. Bockenholt (2001a) de- 
scribes the worth of object i for subject s as 

p 

p=l 

where Ugi is a random component, and Xj is a vec- 
tor of P subject-specific (and possibly item specific) 
covariates. Bockenholt (2001a) employs a logit link 
function and assumes that Us = {Ugi, ■ ■ ■ , Ugn)' fol- 
lows a multivariate normal distribution with mean 
and covariance S(/. 

Francis, Dittrich and Hatzinger (2010) consider 
the log-linear representation of the Bradley-Terry 
model and introduce random effects for each respon- 
dent in order to account for residual heterogeneity 
that is not included in subject-specific covariates. 
The inclusion of random effects in the linear pre- 
dictor introduces difficulties in the estimation of the 
model. 

3.2.4 Choice models The work by Thurstone has 
great importance in the development of models for 
analyzing discrete choices, not only from a psycho- 



metric point of view, but also in economic choice 
theory. When the idea that choices may be ran- 
dom and not fixed started to develop, the use of 
the model proposed by Thurstone was suggested 
(Marschak (I960)). As the Nobel laureate McFad- 
den (2001) states, "when the perceived stimuli are 
interpreted as levels of satisfaction, or utility, this 
can be interpreted as a model for economic choice." 

According to the economic theory, models for dis- 
crete choice are required to satisfy the utility maxi- 
mization assumption which states that subjects max- 
imize their utility when making decisions. Let T^j 
denote the utility of subject s from alternative i 
which can be decomposed as T^j = M^j + £si, where 
Msi denotes a function which relates a set of al- 
ternative attributes and subject attributes to the 
utility gain and Sgi denotes factors that affect util- 
ity, but are not included in Mgi- The probability 
that subject s chooses alternative i is equal to the 
probability that the utility gained from i is higher 
than the utility from every other object in the choice 
set: pr(Tsj > Tsj,\/ij^j) = pT{esi-esj < Mgi-M^j, 
j). These models are called random utility mod- 
els. For each person, a choice is described as n — 1 
paired comparisons between the preferred alterna- 
tive and all other options. Note that paired com- 
parisons do not really occur, so inconsistent choices 
cannot be observed. 

From the above specification, different models have 
been developed depending on the assumptions about 
the distribution of the errors and the formulation 
of the mean term M^j. If the e^j's are independent 
and follow a Gumbel distribution the choice model is 
a logit model and, when Mgi = ^'stf^i it corresponds 
to the structured Bradley-Terry model. A particu- 
lar concern is caused by the independence from ir- 
relevant alternatives (Luce (1959)) property which 
characterizes the Bradley-Terry model. Indeed, in 
the Bradley-Terry model the ratio between proba- 
bilities of choosing one option over another is inde- 
pendent from the other available alternatives. Often, 
this property is not satisfied in real data. This limit 
is somehow overcome by assuming a type of gen- 
eralized extreme value distribution for the errors. 
In the resulting nested logistic model, independence 
from irrelevant alternatives holds for sets of alterna- 
tives within a same subset and not for alternatives 
in different subsets (Train (2009)). The advantage of 
these specifications is that models can be estimated 
easily, but they cannot account for random taste 
variation or unobserved factors correlated over time. 
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A further proposal is to assume a multivariate nor- 
mal distribution for the errors Egi- This model is 
very flexible since it allows for random taste varia- 
tion and, when necessary, for temporally correlated 
errors, but its estimation is not straightforward. The 
resulting model is a multivariate probit model, like 
the Thurstone model. In economic choice models 
it is of interest to consider the influence on deci- 
sions of covariates that are included in the mean 
term Mgi- Explanatory variables can be considered 
also in psychometric models (Tsai and Bockenholt 
(2002)), even though interest is focused on the pa- 
rameters fi which are always included in the linear 
predictor. 

Other extensions include further random elements 
in the mean term M^j, so as to allow flexible distur- 
bances or to account for different attitudes and per- 
ceptions of different people. All these elements add 
difficulties in the estimation of the model. 

An important aspect in choice theory is the dis- 
tinction between stated and revealed preferences. 
This problem has not received much attention in 
the psychometric literature, but there may be differ- 
ences between what people say they would choose in 
a questionnaire survey and what they really choose. 
The former are called stated preferences and the 
latter revealed preferences. If both types of prefer- 
ences are available, it may be useful to analyze them 
all together. Walker and Ben-Akiva (2002) propose 
a model that incorporates many of the above ex- 
tensions; however, care is needed when specifying 
the model because it may be difficult to understand 
which parameters can be identified. Moreover, the 
inclusion of additional disturbances and unobserved 
covariates requires the approximation of integrals 
whose dimension can be high. 

Random utility models are very useful and widely 
spread; however, some doubts have been raised about 
their basic assumption that people act as to maxi- 
mize their utility since sometimes consumers do not 
make rational choices (Bockenholt (2006)). 

3.3 Object- Related Dependencies 

In the multiple judgment sampling the dependence 
among observations derives from repeated compar- 
isons made by the same person, usually involving 
a common object. In case paired comparisons are 
not performed by a judge, the correlation may arise 
from the fact that the same object is involved in mul- 
tiple paired comparisons. For example, when con- 
tests among animals are analyzed, it is realistic to 
assume that comparisons involving the same animal 



are correlated. In this perspective. Firth (2005) sug- 
gests to set 

(3.3) //i=x^/3 + C/„ 

where Ui is a zero mean object-specific random ef- 
fect. This approach is investigated in Cattelan (2009). 
The results of comparisons are related to observed 
characteristics of the animal and to unobserved quan- 
tities that are captured by the random effect Ui. 

In this case, the latent random variable can be 
written as 

Z = AX/3 + AU + r?, 

where U = {Ui, . . . , C/„) is the vector of all object- 
specific random effects, X is the matrix of covari- 
ates with columns Xj, rj are independent normally 
distributed errors with mean and variance 1 while 
the matrix A is the design matrix of the paired 
comparisons with rows that describe which compar- 
isons are observed, not necessarily all possible paired 
comparisons. If it is assumed that U is multivari- 
ate normal with mean and covariance In<7^, then 
Z ~ A^(AX/3,cr^AA-^ + Irf), where d is the number 
of paired comparisons observed. Again, this model 
is a multivariate probit model. However, this type 
of data presents some different features with respect 
to multiple judgment sampling. While in pshycho- 
metric applications n is not very large because it 
is unlikely that a person will make all the paired 
comparisons when n > 10, this will typically hap- 
pen in sport tournaments or in paired comparison 
data about animal behavior. Moreover, in the multi- 
ple judgment sampling scheme S independent repli- 
cations of all the comparisons are available, but in 
other contexts this does not occur, adding further 
difficulties. 

3.4 Inference 

3.4.1 Estimation In this section, the multiple judg- 
ment sampling scheme is mainly investigated, and 
only some comments are made about the case of 
object-related dependencies. There are different meth- 
ods for estimating models for dependent paired com- 
parison data. A first approach to the computation of 
the likelihood function requires to integrate out the 
latent variables T from the joint distribution of Y 
and T. This integral has dimension n, the number of 
items, but rewriting it in terms of differences Ti — Tn, 
i = l,...,n — 1, the dimension can be reduced to 
n — 1, which nonetheless may still be quite large 
when methods such as the Gauss-Hermite quadra- 
ture are employed. 
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Alternatively, it is possible to represent the joint 
distribution of the observations as a multivariate 
probit model. Let Z* = D(Zs — A/i) be the stan- 
dardized version of the latent variable Z^, where 
D = [diag(S;z)]~^/^ and denotes the covariance 
matrix of Z^ expressed as in model (3.1) or in model 
(3.2). Then, Z* follows a multivariate normal distri- 
bution with mean and correlation matrix S^* = 
DI^^D. Object i is preferred to object j when z*^j > 
Tj*-, where the vector of the thresholds is given by 
T* = — DA/x. The likelihood function is the product 
of the probability of the observations for each judge 

s 

£(^;Y) = J] £,(./.; Y,), 

s=l 

where 

Cs{ip;Y,)= f ■■ I 0;v(z:;Sz*)dz:, 

J Rsl2 J Rs?i-l?i 

(Pn{'','^z*) denotes the density function of an A^- 
dimensional normal random variable with mean 
and correlation matrix 'Sz* and 

_r(-oo,r*.) ify,ij = l, 
^-^■-\(r*.,oo) ify.i,=2. 

Note that this approach requires the approxima- 
tion of S integrals whose dimension is equal to A = 
n{n— l)/2, the number of paired comparisons, so its 
growth is quadratic with the increase in the number 
of objects. However, there is a large literature about 
methods for approximate inference in multivariate 
probit models. The algorithm proposed by Genz and 
Bretz (2002) to approximate multivariate normal 
probabilities is based on quasi-Monte Carlo meth- 
ods, and Craig (2008) warns against the randomness 
of this method for likelihood evaluation. A determin- 
istic approximation is developed by Miwa, Hayter 
and Kuriki (2003), but it is available only for in- 
tegrals of dimension up to 20 since even for such 
a dimension its computation is very slow. Approxi- 
mations based on Monte Carlo methods can be used 
(Chib and Greenberg (1998)), but they may be com- 
putationally expensive if the dimension of the inte- 
gral is very large. Bockenholt and Tsai (2001) use an 
EM algorithm, while in econometric theory a maxi- 
mum simulated likelihood approach in which multi- 
variate normal probabilities are simulated through 
the Geweke-Hajivassiliou-Keane algorithm is em- 
ployed (Train (2009)). A further approach may be 
based on data cloning (Lele, Nadeem and Schmu- 
land (2010)). When integrals are very large, and the 
approximation is computationally demanding and 



time-consuming, it is possible to resort to limited in- 
formation estimation methods, which are estimation 
procedures based on low dimensional margins. Here, 
we compare two different methods. The first one is 
widely applied in the context of multiple judgment 
sampling (Maydeu-Olivares, 2001, 2002; Maydeu- 
Olivares and Bockenholt 2005) and will be called 
limited information estimation; the second is pro- 
posed in the context of object-specific dependencies 
in Cattelan (2009) and is called pairwise likelihood. 

The limited information estimation procedure con- 
sidered here consists of three stages. In the first stage 
the threshold parameters r* are estimated exploit- 
ing the empirical univariate proportions of wins. In 
the second stage the elements of ^z*, which are 
tetrachoric correlations, are estimated employing the 
bivariate proportions of wins. Finally, in the third 
stage the model parameters are estimated by min- 
imizing the function 

(3.4) G = {k-k(-0)}'W{k-k(i/;)}, 

where k denotes the thresholds, and tetrachoric cor- 
relations, estimated in the first and second stages, 
denotes the thresholds, and tetrachoric corre- 
lations under the restrictions imposed on those pa- 
rameters by the model parameters ip and W is a non- 
negative definite matrix. Let H denote the asymp- 
totic covariance matrix of k. Then it is possible 
to use W = H"^ (Muthen (1978)), W = [diag(3)]^i 
(Muthen, Du Toit and Spisic (1997)) or W = I 
(Muthen (1993)). The last two options seem more 
stable in data sets with a small number of objects 
(Maydeu-Olivares (2001)). This method is very fast, 
and Maydeu-Olivares (2001) states that it may have 
an edge over full information methods because it 
uses only the one and two-dimensional marginals of 
a large and sparse contingency table. 

Pairwise likelihood (Le Cessie and Van Houwelin- 
gen (1994)) is a special case of the broader class of 
composite likelihoods (Lindsay (1988); Varin, Reid 
and Firth (2011)). The pairwise likelihood of all the 
observations is the product of the pairwise likeli- 
hoods relative to the single judges £pair('0;Y) = 
nil ^^air(^;Ys), where 

^pair(V';Y,) 

n—2 n—1 n—1 n 

=n n n n p'^i^sij=ysij,yski=yski)- 

1=1 j=i+l k=i l=j+l 

Let il^iM;Ys) = log£;,i,(^; Y,) denote the log- 
arithm of the pairwise likelihood for subject s and 
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£pair(V'; Y) = Ef=i^;air(V^; Ys) be the whole pair- 
wise log-likelihood. Under usual regularity condi- 
tions on the log-likelihood of univariate and bivari- 
ate margins, the maximum pairwise likelihood esti- 
mator is consistent and asymptotically normally dis- 
tributed with mean if) and covariance matrix 
H(t/;)-lJ(V^)H(V')-^ where J(t/») = var{V£pair(V^; 
Y)} and H(V') =-E{-V2£pair('0;Y)} (Molenberghs 
and Verbeke (2005); Varin, Reid and Firth (2011)). 
Unfortunately, the analogous of the likelihood ratio 
test based on pairwise likelihood does not follow the 
usual chi-square distribution (Kent (1982)). In the 
multiple judgment sampling context, it is natural 
to consider asymptotic properties of pairwise like- 
lihood estimators computed as the number of sub- 
jects increases, that is, as 5 — )■ oo. When the number 
of paired comparisons per subject is bounded, the 
above properties are satisfied (Zhao and Joe (2005)). 
Pairwise likelihood reduces noticeably the computa- 
tional effort since it requires only the computation 
of bivariate normal probabilities. The standard er- 
rors can be computed straightforwardly by exploit- 
ing the independence between the observations of 
different judges. In fact, H('j/>) can be estimated 
by the Hessian matrix computed at the maximum 
pairwise likelihood estimate, while the cross-product 
Ef=i ^^Uiri^; ^s)ViUA^; Y,)' can be used to es- 
timate J(i/'). 

The case of object-related dependencies is not con- 
sidered in the following simulation study; however, 
note that some different difficulties arise. As already 
pointed out, in this context there is a large n and 
small S, so the limited information estimation meth- 
od cannot be applied, but pairwise likelihood can 



still be employed (Cattelan (2009)). However, it is 
more problematic to consider the asymptotic behav- 
ior of the maximum pairwise likelihood estimator 
when data are a long sequence of dependent observa- 
tions; see, for example. Cox and Reid (2004). In the 
context of paired comparison data, results of sim- 
ulations for increasing n when all possible paired 
comparisons are performed are encouraging (Cat- 
telan (2009)); however, theoretical results for this 
instance are still lacking. 

3.4.2 Simulation studies Simulation studies were 
performed considering models (3.1) and (3.2). It is 
assumed that n = 4; hence also a full likelihood ap- 
proach based on the algorithm by Miwa, Hayter and 
Kuriki (2003) can be used since the integral has di- 
mension 6. 

The first simulation setting is the same as that 
proposed in Maydeu-Olivares (2001), where the mod- 
el Zs = AT + is assumed with 

0.5 X /I 

_ 0.8 1 

-0.5 I ' ^~ 0.7 0.6 1 

/ \0.8 0.7 0.6 1, 

and the covariance matrix of e is n = w^Ig. For iden- 
tification purposes the diagonal elements of Sr are 
set equal to 1, /i4 = and = 1. Hence, in this 
case St is actually a correlation matrix. Table 4 
shows the mean and medians of the simulated esti- 
mates on 1000 data sets assuming S = 100 judges. 
Moreover, the average of model-based standard er- 
rors and the simulation standard deviations are re- 
ported. In limited information estimation, the ma- 
trix W = I is employed. In this setting all the meth- 
ods seem to perform comparably well. Table 5 shows 



V 



Table 4 

Average (Mn) and median (Md) simulated estimates, average model-based standard errors (s.e.) and 
simulation standard deviations (s.d.) of parameters estimated by maximum likelihood (ML), 
limited information estimation (LI) and pairwise likelihood (PL) 





True 
value 




ML 






LI 








PL 








Mn 


s.e. 


s.d. 


Mn 


Md 


s.e. 


s.d. 


Mn 


Md 


s.e. 


s.d. 


Ml 


0.5 


0.51 


0.13 


0.13 


0.51 


0.50 


0.13 


0.13 


0.50 


0.50 


0.13 


0.13 


M2 





0.01 


0.12 


0.13 


0.01 


0.01 


0.12 


0.13 


0.01 


0.01 


0.12 


0.13 


M3 


-0.5 


-0.49 


0.15 


0.15 


-0.50 


-0.48 


0.15 


0.15 


-0.49 


-0.48 


0.15 


0.15 


0"12 


0.8 


0.80 


0.12 


0.14 


0.78 


0.80 


0.13 


0.14 


0.79 


0.80 


0.13 


0.15 


0"13 


0.7 


0.70 


0.17 


0.17 


0.69 


0.71 


0.17 


0.17 


0.69 


0.71 


0.18 


0.18 


an 


0.8 


0.79 


0.13 


0.14 


0.78 


0.79 


0.13 


0.14 


0.78 


0.80 


0.14 


0.15 


0"23 


0.6 


0.58 


0.19 


0.20 


0.57 


0.60 


0.19 


0.20 


0.57 


0.60 


0.19 


0.20 


0"24 


0.7 


0.68 


0.16 


0.16 


0.66 


0.67 


0.16 


0.17 


0.67 


0.68 


0.16 


0.17 


0"34 


0.6 


0.58 


0.21 


0.20 


0.57 


0.60 


0.20 


0.20 


0.57 


0.60 


0.20 


0.20 
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Table 5 

Empirical coverage of confidence intervals for model 
parameters of limited information estimator (LI) 
and pairwise likelihood estimator (PL) at 
nominal levels 95%, 97.5% and 99% 



0.950 0.975 0.990 





LI 


PL 


LI 


PL 


LI 


PL 


Ml 


0.947 


0.958 


0.982 


0.978 


0.992 


0.992 


fJ-2 


0.960 


0.964 


0.978 


0.976 


0.988 


0.988 




0.941 


0.930 


0.969 


0.972 


0.995 


0.991 


0"12 


0.959 


0.985 


0.975 


0.997 


0.989 


1.000 


Cri3 


0.934 


0.939 


0.961 


0.967 


0.968 


0.985 


(Ti4 


0.941 


0.968 


0.967 


0.996 


0.988 


1.000 


0"23 


0.965 


0.970 


0.973 


0.980 


0.987 


0.995 


0"24 


0.943 


0.933 


0.951 


0.959 


0.967 


0.973 


0"34 


0.953 


0.946 


0.969 


0.966 


0.977 


0.989 



the empirical coverages of confidence intervals based 
on the normal approximation. 

The second simulation setting considers model (3.2) 
proposed by Tsai and Bockenholt (2008). Here, we 
consider differences with a reference object, so we 
compute means and variances of the differences Tj = 
Tj — Tn for i = 1, . . . ,n — 1. The assumed worth pa- 
rameters of these differences are /t = (—0.2, 1, —1.5) 
while the covariance matrix is 

/1.5 1 1.3\ 

1 4 2.5 , 
\1.3 2.5 3 / 

and (Tjj is used to denote the element in row i and 
column j of the above reduced matrix. Differently 
from the previous setting, this specification of the 



model allows one to estimate also the variance of the 
differences Tj — T„ and to check whether they are dif- 
ferent for the various objects. Tsai and Bockenholt 
(2008) propose a specification of the matrix B which 
depends only on one parameter b whose value is set 
equal to 0.5. 

Table 6 presents the results of the simulations. 
Maximum likelihood based on numerical integration 
is the method that performs best; however, maxi- 
mization of the likelihood was not always straight- 
forward, and sometimes the optimization algorithms 
employed stopped at a point where the Hessian ma- 
trix was not negative definite. 

Pairwise likelihood estimation seems to perform 
quite well, especially if compared to limited infor- 
mation estimation, which seems not satisfactory in 
this case with S = 100, as already noticed in Tsai 
and Bockenholt (2008). Estimating the parameters 
of the covariance matrix appears more problematic 
than the estimation of the worth parameters, and 
the average of the simulated estimates is particu- 
larly influenced by some large values, but the median 
shows a better performance. In particular, while the 
average simulated estimates for limited information 
estimation shows a maximum percentage bias equal 
to 44.1%, for the median it reduces to 15.4%. The 
maximum bias for the mean of the simulated esti- 
mates using pairwise likelihood is 16.1%, while for 
the median it is 4%. In both cases, pairwise likeli- 
hood shows lower bias. The standard errors of pair- 
wise likelihood estimates are lower, thus yielding 
shorter confldence intervals. Table 7 reports the em- 
pirical coverage of Wald-type confidence intervals for 



Table 6 

Average (Mn) and median (Md) simulated estimates, average model-based standard errors (s.e.) and 
simulation standard deviations (s.d.) of parameters estimated by maximum likelihood (ML), 
limited information estimation (LI) and pairwise likelihood (PL) 





True 
value 




ML 






LI 








PL 








Mn 


s.e. 


s.d. 


Mn 


Md 


s.e. 


s.d. 


Mn 


Md 


s.e. 


s.d. 




-0.2 


-0.21 


0.19 


0.18 


-0.23 


-0.21 


0.21 


0.22 


-0.22 


-0.20 


0.19 


0.19 


Az 


1 


1.00 


0.30 


0.31 


1.07 


1.07 


0.42 


0.47 


1.03 


1.00 


0.33 


0.33 


As 


-1.5 


-1.51 


0.31 


0.32 


-1.59 


-1.59 


0.49 


0.51 


-1.54 


-1.51 


0.36 


0.35 




1.5 


1.53 


0.83 


0.81 


2.06 


1.58 


1.97 


1.64 


1.70 


1.44 


1.05 


0.95 




4 


3.98 


1.73 


1.75 


5.34 


4.37 


4.42 


4.48 


4.45 


3.92 


2.42 


2.15 




3 


3.01 


1.41 


1.42 


3.91 


3.19 


3.17 


3.25 


3.32 


3.04 


1.93 


1.73 




1 


0.98 


0.70 


0.64 


1.34 


1.06 


1.44 


1.30 


1.12 


0.97 


0.87 


0.77 




1.3 


1.29 


0.73 


0.71 


1.72 


1.39 


1.48 


1.49 


1.43 


1.27 


0.95 


0.84 




2.5 


2.49 


1.09 


1.09 


3.35 


2.72 


2.67 


2.77 


2.77 


2.48 


1.53 


1.33 


b 


0.5 


0.53 


0.41 


0.39 


0.72 


0.58 


0.82 


0.98 


0.58 


0.50 


0.50 


0.51 
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Table 7 

Empirical coverage of confidence intervals for model 
parameters of limited information estimator (LI) 
and pairwise likelihood estimator (PL) at 
nominal levels 95%, 97.5% and 99% 



0.950 0.975 0.990 





LI 


PL 


LI 


PL 


LI 


PL 




0.955 


0.935 


0.981 


0.965 


0.994 


0.983 




0.962 


0.960 


0.973 


0.974 


0.986 


0.986 


As 


0.920 


0.938 


0.941 


0.960 


0.961 


0.977 




0.932 


0.922 


0.947 


0.936 


0.959 


0.966 




0.932 


0.924 


0.949 


0.945 


0.964 


0.961 




0.936 


0.937 


0.949 


0.953 


0.963 


0.964 


5"12 


0.932 


0.937 


0.951 


0.956 


0.966 


0.970 


5"13 


0.915 


0.912 


0.929 


0.933 


0.939 


0.945 


5"23 


0.922 


0.920 


0.941 


0.937 


0.953 


0.951 


b 


0.936 


0.936 


0.946 


0.953 


0.963 


0.963 


the 


estimated 


limited 


information 


estimation 


and 



pairwise likelihood. The coverage rates of the two 
methods are very similar, and in both cases the ac- 
tual coverage for parameters of the covariance ma- 
trix appears systematically lower than the nominal 
levels. In order to obtain accurate coverage probabil- 
ities, we may need to resort to a bootstrap procedure 
for detecting the distribution of the statistic, while 
with pairwise likelihood it may be possible to obtain 
intervals based on the pairwise likelihood function. 

Example. We fit model (3.1) to universities' data 
by means of pairwise likelihood. A full likelihood 
approach based on numeric approximation implies 
computing 303 integrals of dimension 5, in case a uni- 



versity is used as reference object, both for the mean 
and covariance structure, but methods such as the 
Gauss-Hermite quadrature are affected by the curse 
of dimensionality. A multivariate probit approach 
would require a very slow computation because the 
algorithm by Miwa would take very long to approx- 
imate 303 integrals of dimension 15. It is assumed 
that fl = f^^Ii5. Table 8 displays the results of the 
estimates, employing two different sets of constraints. 
The lower triangle of the covariance matrix shown 
in Table 8 reports the estimates obtained using the 
constraints proposed in Maydeu-Olivares and Her- 
nandez (2007); see Section 3.2.2. The estimate of the 
threshold parameter (with standard error in brack- 
ets) is f2 = 0.205 (0.018) while the variance parame- 
ter is cj^ = 0.180 (0.026). A high correlation is es- 
timated between Barcelona and Milan, so strong 
preference for Barcelona is associated with strong 
preference for Milan. Even though some correlations 
do not seem significant, it appears that a strong 
preference for St. Gallen is associated with a weak 
preference for all the other universities but Stock- 
holm. The worth parameters denote the same rank- 
ing of all universities as the one arising from Ta- 
ble 2. However, note that the estimated worth pa- 
rameters cannot be considered as absolute measures 
of worth of items; indeed, it is possible to obtain 
alternative solutions that give an equivalent fitting. 
The mean parameters that can be identified in the 
model are standardized differences, that is, {fii — 

f-'-e)/ \l (^i + f^i ~ 2ai6 + w^, i = l,...,5, where /ie 
and (Tg are the mean and variance of the latent vari- 
able referring to Stockholm, the reference university. 



Table 8 

Estimates and standard errors (in brackets) of mean and correlation parameters of model (3.1) for universities data 
using constraints proposed by Maydeu-Olivares and Hernandez (2007). In italics the estimates and 
standard errors of a model with fixed correlation between Paris and St. Gallen 





Barcelona 


London 


Milan 


Paris 


St. Gallen 


Stockholm 




Barcelona 


1 


-0.064 


0.688 


0.063 


-0.472 


0.265 


0.405 




(fixed) 


(0.183) 


(0.085) 


(0.158) 


(0.146) 


(0.145) 


(0.073) 


London 


0.058 


1 


0.079 


-0.069 


-0.287 


0.227 


1.346 




(0.084) 


(fixed) 


(0.18.5) 


(0.224) 


(0.147) 


(0.154) 


(0.087) 


Milan 


0.724 


0.185 


1 


0.244 


-0.466 


0.253 


0.308 




(0.062) 


(0.097) 


(fixed) 


(0.174) 


(0.137) 


(0.160) 


(0.074) 


Paris 


0.171 


0.054 


0.331 


1 


-0.690 


0.033 


0.748 




(0.094) 


(0.117) 


(0.113) 


(fixed) 


(fixed) 


(0.267) 


(0.086) 


St. Gallen 


-0.303 


-0.139 


-0.298 


-0.496 


1 


0.194 


0.371 




(0.113) 


(0.139) 


(0.144) 


(0.157) 


(fixed) 


(0.135) 


(0.081) 


Stockholm 


0.350 


0.316 


0.339 


0.144 


0.287 


1 







(0.079) 


(0.091) 


(0.097) 


(0.113) 


(0.130) 


(fixed) 


(fixed) 



16 



M. CATTELAN 



From the identified parameters, different covariance 
matrixes of the universities can be recovered. For ex- 
ample, in this instance where the matrix S-p can be 
interpreted as a correlation matrix, it is shown that 
the worth parameters -v/c/i, the correlation matrix 
cSt + (1 — c)ll' and the covariance matrix of the 
pair-specific errors cfi produce the same fitting of 
the model for a positive constant c such that the cor- 
relation matrix remains positive definite (Maydeu- 
Olivares and Hernandez (2007)). It is possible to set 
one of the parameters of the correlation matrix ac- 
cording to some assumption, for example we may 
presume that a strong preference for Paris is asso- 
ciated with a weak preference for St. Gallen, and 
determine the value of c which minimizes the cor- 
relation between the two universities while yielding 
a positive definite correlation matrix. The value is 
c = 1.13 which produces a correlation between Paris 
and St. Gallen equal to —0.690. The estimates of 
the correlation matrix with this fixed value of cor- 
relation between Paris and St. Gallen are shown in 
the upper triangle of the matrix in Table 8. The 
worth parameters can be computed by multiplying 
the estimates shown in Table 8 by \/1.13. The fitting 
of the two models is equal, but in the second case 
estimation is based on some previous theory about 
correlation between a certain couple of universities. 

This analysis has only an illustrative purpose, in 
particular Bockenholt (2001b) finds that a model 
with thresholds that vary among subjects performs 
better than a model with a constant threshold pa- 
rameter. 

3.4.3 Model selection and goodness of fit Paired 
comparison data can be arranged in a contingency 
table. In case of multiple judgment sampling the 
data can be arranged in a table of dimension 2^ 
when there are two possible outcomes and H'^ when 
the outcomes are if-categorical. As a result, the con- 
tingency table will typically be very sparse, espe- 
cially if covariates are included so that paired com- 
parisons are observed conditional on the values of 
the covariates. In this situation the likelihood ra- 
tio statistic and the Pearson statistic do not follow 
a distribution, nevertheless these statistics are of- 
ten employed to assess the model and for model se- 
lection. Differences between observed and expected 
frequencies for subsets of the data, as the 2x2 sub- 
tables or triplets of comparisons, are sometimes con- 
sidered in order to identify where the fitting of the 
model is not good. In Dittrich et al. (2007) the de- 
viance is used for selection between nested models. 



but the test of goodness of fit cannot be based on the 
asymptotic distribution so a Monte Carlo proce- 
dure is employed. 

Since the goodness of fit of the model cannot be 
assessed through the usual statistics and Monte Car- 
lo procedures are computationally expensive, some 
statistics based on lower dimensional marginals of 
the contingency table have been proposed. In gen- 
eral the statistics proposed are quadratic forms of 
the residuals 

(3.5) {pr - 7r^(^)}'C{pr - 7rr(V')}, 

where C is a weight matrix, p,. denotes the sample 
marginal proportions and r denotes a set of lower 
order marginals. 

Maydeu-Olivares (2001) considers the statistic G 
as in (3.4) employed for estimation, which corre- 
sponds to setting C = W in (3.5) and r denoting 
univariate and bivariate marginal probabilities. The 
statistic SG is analyzed in order to test Hq : k = 

When W = H , then SG -4 Xd where d = 
N{N + 1) /2 — q and q is the number of model param- 
eters. However, when W= [diag(3)]~^ or W = /, 
the asymptotic distribution of the statistic is a weight- 
ed sum of d chi-square random variables with one 
degree of freedom. Maydeu-Olivares (2001) proposes 
to rescale the test statistic in order to match the 
asymptotic chi-square distribution. The same proce- 
dure is followed in the proposal for testing Hq : 7V2 = 

("0), where 7T2 is the vector of all univariate and 
bivariate marginal probabilities. Maydeu-Olivares 
(2006) considers the testing of further hypotheses 
but the issue of the asymptotic distribution being 
a weighted sum of chi-square distributions remains. 

Maydeu-Olivares and Joe (2005) consider testing 
the hypothesis Hq itt = 7r(i/>) in a multidimensional 
contingency table, where tt is the 2^-dimensional 
vector of joint probabilities. Again, the use of mar- 
ginal residuals up to order r is considered. Let tt 
denote a vector which stacks all the marginal prob- 
abilities: univariate, bivariate, trivariate and so on. 
There is a one-to-one correspondence between tt and 
TT so that for a particular matrix A of O's and I's 
TT = Att. If only marginal probabilities up to order r 
are considered, then tt^ = A^tt for a sub-matrix A^ 
of A. Let A = div/dtp and F = E — tttt', where 
E = diag(7r). Maydeu-Olivares and Joe (2005) pro- 
pose the statistic 

(3.6) Mr = S{pr - 7r,(-0)}'a('0){p, - TVrW}, 

where Cr{^) = - A,.(A;F-i A^)"! A;F-i, 
Fr = ArTA'j. and A^ = A^, A. Mr is asymptotically 
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distributed as a xf-q random variable where / is the 
length of Pr- The Mr statistic asymptotically fol- 
lows a chi-square distribution not only when -0 is 
the maximum likelihood estimator, but also when it 
is a 'v/S'-consistent estimate, such as the limited in- 
formation estimator and the pairwise likelihood esti- 
mator presented in Section 3.4.1. Since the marginals 
should not be sparse, Maydeu-Olivares and Joe (2005) 
suggest to use M2 when the model is identified using 
only univariate and bivariate information, also be- 
cause only up to bivariate sample moments and four- 
way model probabilities are involved in the compu- 
tation of M2. As the number of cells gets larger, the 
dimension of the matrices involved in (3.6) increases 
noticeably, and tricks may be necessary to do the 
computations. Analysis and extensions of this type 
of test are considered in Maydeu-Olivares and Joe 
(2006), Reiser (2008) and Joe and Maydeu-Olivares 
(2010). All applications considered regard item re- 
sponse theory, so an investigation of their perfor- 
mance in paired comparison data is necessary to 
understand the sample size needed for obtaining ac- 
curate Type I errors using M2. 

4. SOFTWARE 

Fitting models to paired comparison data is facil- 
itated by some R packages which allow fitting of the 
classical models and, in some cases, also fitting of 
more complicated models. 

The eba package (Wickelmaier and Schmid (2004)) 
fits elimination by aspects models (Tversky (1972)) 
to paired comparison data. The elimination by as- 
pects model assumes that different objects present 
various aspects. The worth of each object is the 
sum of the worth associated with each aspect pos- 
sessed by the object. When all objects possess only 
one relevant aspect, then the elimination by aspects 
model reduces to the Bradley-Terry model. There- 
fore, in case only one aspect per object is specified, 
the function eba can be used to fit model (2.1) with 
logit link, while when the link is probit the function 
thurstone can be used. The function strans checks 
how many violations of weak, moderate and strong 
stochastic transitivity are present in the data. 

The pref mod package (Hatzinger (2010)) fits Brad- 
ley-Terry models exploiting their log-linear repre- 
sentation. Ordinal paired comparisons are allowed, 
but the software reduces the total number of cate- 
gories to three or two, depending on whether there 
is a no preference category or not. 

There are three different functions for estimating 
models for paired comparison data: the llbt.fit 



function which estimates the log-linear version of the 
Bradley-Terry model through the estimation algo- 
rithm described in Hatzinger and Francis (2004), the 
llbtPC.fit function that estimates the log-linear 
model exploiting the gnm (Turner and Firth (2010b)) 
function for fitting generalized nonlinear models and 
the pattPC.f it function, which fits paired compar- 
ison data using a pattern design, that is, all possible 
patterns of paired comparisons. The latter function 
handles also some cases in which the responses are 
missing not at random; see Section 5. A difficulty of 
this approach is that the response table grows dra- 
matically with the number of objects since, in case of 
only two possible outcomes, the number of patterns 
is 2^, so no more than six objects can be included 
with two response categories, and not more than five 
with three response categories. Finally, the function 
pattnpml .fit fits a mixture model to overdispersed 
paired comparison data using nonparametric maxi- 
mum likelihood. 

The BradleyTerry2 package (Turner and Firth 
(2010a)) expands the previous BradleyTerry (Firth 
(2008)) package and allows one to fit the unstruc- 
tured model (2.1) and extension (2.3) with logit, 
probit and cauchit link functions, including also com- 
parison-specific covariates. Model fitting is either by 
maximum likelihood, penalized quasi-likelihood or 
bias-reduced maximum likelihood (Firth (1993)). In 
case of object specific random effects, as in model 
(3.3), penalized quasi-likelihood (Breslow and Clay- 
ton (1993)) is used, while when an object wins or 
loses all the paired comparisons in which it is in- 
volved and its estimate worth parameter is infinite, 
then the bias-reduced maximum likelihood produces 
finite estimates. If there are missing explanatory vari- 
ables, an additional worth parameter for the object 
with missing covariates is estimated. Order effects 
and more general comparison-specific covariates can 
be included, but only win-loss responses are allowed. 

The package psychotree (Strobl, Wickelmaier and 
Zeileis (2011)) implements the method for recursive 
partitioning of the subjects on the basis of their ex- 
planatory variables and estimates an unstructured 
Bradley-Terry model for each of the final subgroups 
of subjects; see Section 2.4. 

Although the available packages have many use- 
ful features, a combination of those provided by the 
different packages and also some additional features 
could be of practical help. The prefmod and 
BradleyTerry2 packages were built with the aim of 
analyzing multiple judgment data and tournament- 
like data, respectively. This is reflected in the dif- 



18 



M. CATTELAN 



ferent characteristics of the packages. A function 
that can handle data with at least three-categorical 
results, thus allowing for the "no preference" cat- 
egory, include different link functions, and an easy 
implementation of object-, subject- and comparison- 
specific covariates in a linear model framework would 
be useful. The available methods for including de- 
pendencies between observations are only in a log- 
linear framework through the introduction of fur- 
ther parameters in the predictor or including object- 
related random effects, which are estimated by means 
of penalized quasi likelihood, a method that does not 
perform well with binary data. At present, there are 
no available packages for the analysis of paired com- 
parison data that allow the fitting of models as those 
presented in Section 3.2.1. However, implementation 
of pairwise likelihood estimation for those models is 
straightforward since it implies only the computa- 
tion of bivariate normal probabilities. 

5. CONCLUSIONS 

This paper reviews some of the extensions pro- 
posed in the literature to the two most commonly 
applied models for paired comparison data, namely 
the Bradley-Terry and the Thurstone models. How- 
ever, not every aspect could be considered here, and 
among issues that have not been treated, there are 
the development of models for multi-dimensional da- 
ta when objects are evaluated with respect to multi- 
ple aspects (Bockenholt (1988); Dittrich et al. (2006)), 
the temporal extension for comparisons repeated in 
time (Fahrmeir and Tutz (1994), Glickman (2001), 
Bockenholt (2002), Dittrich, Francis and Katzen- 
beisser (2008)), the estimation of abilities of individ- 
uals belonging to a team that performs the paired 
comparisons (Huang, Weng and Lin (2006); Menke 
and Martinez (2008)) and many more. Another im- 
portant issue concerns the optimal design of the ex- 
periment. GraBhoff et al. (2004) show that the mini- 
mum sample size required for maximizing the deter- 
minant of the information matrix in an unstructured 
Bradley-Terry model requires that every compari- 
son is performed once. When objects are specified 
using factors with a certain number of levels, the 
required sample size grows exponentially, while the 
number of parameters grows linearly as the num- 
ber of factors increases. Some designs, in order to 
reduce the number of required comparisons, are in- 
vestigated in Grafihoff et al. (2004). In Grafihoff 
and Schwabe (2008) a characterization of the lo- 
cally optimal design in case of two factors design in 



a Bradley-Terry model is given, but for more com- 
plex situations it seems difficult to give general re- 
sults. Goos and Grossmann (2011) consider also the 
problem when within-pair order effects are present. 
It seems that investigation of these issues in other 
models are not present in the literature. 

The methods for independent data are well estab- 
lished, and a lot of literature has been published 
about them. The problem of the asymptotic behav- 
ior of the maximum likelihood estimator has been 
tackled. The case of a fixed number of objects and 
increasing number of comparisons per couple does 
not seem to pose particular difficulties for standard 
arguments, while more problematic appears the in- 
stance of a fixed number of comparisons per cou- 
ple and increasing number of items. In the context 
of the unstructured Bradley-Terry model, Simons 
and Yao (1999) find a condition on the growth rate 
of the largest ratio between item worth parameters 
which assures that the maximum likelihood estima- 
tor is consistent and asymptotically normally dis- 
tributed. Yan, Yang and Xu (2012) investigate the 
case in which the number of comparisons per cou- 
ple is not fixed, and some comparisons may also be 
missing, and find a condition that assures normality 
of the maximum likelihood estimator. We are not ac- 
quainted with any other investigation of asymptotic 
behavior of estimators in models different from the 
unstructured, independent Bradley-Terry model. 

Particular attention has been focused on mod- 
els for dependent data. Thurstonian models appear 
particularly suitable to account for dependence be- 
tween observations. However, the problems posed 
by the identification restrictions are noticeable. The 
estimated model has to be interpreted with refer- 
ence to a class of covariance matrices, and differ- 
ent identification restrictions may lead to different 
class of matrices. It is possible to rotate the ma- 
trix according to a predefined hypothesis about the 
covariance between certain items (Maydeu-Olivares 
and Hernandez (2007)), but the estimated standard 
errors vary depending on the fixed parameters and 
the significance of the other estimated parameters 
changes. 

In the multiple judgment sampling scheme it is of- 
ten stated that if a judge does not perform all paired 
comparisons, then it suffices to define subject-specific 
matrices (see Section 3.2.1) with rows corre- 
sponding only to the comparisons performed by 
judge s. However, it is expected that this may be 
problematic for estimation by means of limited in- 
formation estimation, and there are no studies about 
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the consequences of missing data in this estimation 
method. 

Missing observations cause problems also for test- 
ing the goodness of fit since quadratic statistics as 
(3.5) assume that all comparisons are performed by 
all subjects. 

Missing data may derive from the design of the ex- 
periment, for example when n is very large, and only 
a subset of all comparisons is presented to each sub- 
ject. Otherwise, if many comparisons are performed 
by the same subject it may be necessary to account 
for the fatigue of subjects and/or for the passing 
of time when comparisons take long in order to be 
accomplished. 

Dittrich et al. (2012) consider the problem of miss- 
ing data in the context of the log-linear representa- 
tion of the Bradley-Terry model since the study of 
the missing mechanism may shed light on the psy- 
chological process. It is assumed that the probability 
that a comparison is missing follows a logistic distri- 
bution since this facilitates the fitting of the model. 
However, the likelihood for such models is not easy 
to compute, and the function in the prefmod pack- 
age allows one to compute it only for data with up 
to six objects. It is not easy to discriminate between 
different types of missing mechanisms, and a very 
large number of observations may be needed in or- 
der to discriminate between a missing completely at 
random and missing not at random situation. 

The economic theory points out some problems 
in choice data that have not been considered yet. 
The main aspects which may need to be incorpo- 
rated in models include the influence that subjects 
can have on each other, the influence of one partic- 
ular subject, that may be some sort of leader, over 
all the other judges and the dependence on choices 
caused by the social and cultural context. Inclusions 
of these aspects will inevitably lead to even more 
complicated models for paired comparison data. 

Finally, methods for object-related dependencies 
present many open problems. Most of the issues are 
connected to the dependence among all comparisons 
which is typically present in this context. Moreover, 
the scheme of paired comparisons is often much less 
balanced than in psychometric experiments. Asymp- 
totic theory in models for dependent data when the 
number of items compared increases has not been 
developed yet. Maximum pairwise likelihood esti- 
mation provided encouraging results, but more ex- 
tensive studies seem necessary. In this case, compu- 
tation of standard errors is problematic since there 
are no independent replications of the data, so a vi- 



able alternative lies in parametric bootstrap. Meth- 
ods for model selection and goodness of fit described 
in Section 3.4.3 require independent replication of 
all comparisons; hence they cannot be employed in 
this setting. 
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